Skip to content

ENH: Add ORC reader #29447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Dec 11, 2019
Merged

ENH: Add ORC reader #29447

merged 19 commits into from
Dec 11, 2019

Conversation

kkraus14
Copy link
Contributor

@kkraus14 kkraus14 commented Nov 6, 2019

Added an ORC reader following the read_parquet API. Still need to give some additional love to the docstrings but this is at least ready for some discussion and eyes on it.

@kkraus14
Copy link
Contributor Author

kkraus14 commented Nov 6, 2019

Note: could also likely use some work to share the functionality between parquet and ORC readers where possible. There's a lot of copy paste currently in this PR.

@alimcmaster1 alimcmaster1 added the IO Data IO issues that don't fit into a more specific label label Nov 6, 2019
@kkraus14
Copy link
Contributor Author

I unfortunately am not going to have the bandwidth to drive this to completion likely for a while, so if someone would like to take this over it would be welcome.

@jreback
Copy link
Contributor

jreback commented Nov 17, 2019

ok let's see what we can do here

@jreback jreback added this to the 1.0 milestone Nov 17, 2019
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

pandas/io/orc.py Outdated
DataFrame
"""

impl = get_engine(engine)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just out of curiosity what other readers do you see implementing here? I assume a lot of this extra code is for config management to return the PyArrowImpl

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nothing, can prob simply this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you still want to do this but commenting for context of next few comments

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would personally simplify this a lot by removing the class.
If at some point we actually want to have different engines, we can always use the approach of the parquet code then.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor things not blockers for me. Looks good overall

pandas/io/orc.py Outdated
DataFrame
"""

impl = get_engine(engine)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you still want to do this but commenting for context of next few comments

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments.

For test testing files, did you write them yourself, or do they come from somewhere? I was wondering if you could make the > 50KB ones a bit smaller.

for data frames. It is designed to make reading data frames efficient. Pandas provides *only* a reader for the
ORC format, :func:`~pandas.read_orc`.

See the documentation for `pyarrow <https://arrow.apache.org/docs/python/>`__ for more.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is actually no documentation about ORC there, so that link is not very helpful ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you mention this reader needs pyarrow to be installed instead?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see 2 lines above about the ORC format

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs are a bit sparse on the pyarrow side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i was just looking through this PR to see who to ask about pyarrow.orc docs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc @jorisvandenbossche did a PR in arrow side for this

pandas/io/orc.py Outdated
DataFrame
"""

impl = get_engine(engine)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would personally simplify this a lot by removing the class.
If at some point we actually want to have different engines, we can always use the approach of the parquet code then.

@jreback
Copy link
Contributor

jreback commented Dec 2, 2019

Added a few comments.

For test testing files, did you write them yourself, or do they come from somewhere? I was wondering if you could make the > 50KB ones a bit smaller.

IIRC @kkraus14 mentioned these were externally generated; these are pretty small so I don't think this is a big deal (and they are skipped when we generate the stripped builds anyhow (e.g. remove tests/data)

@jreback
Copy link
Contributor

jreback commented Dec 3, 2019

updated

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC @kkraus14 mentioned these were externally generated; these are pretty small so I don't think this is a big deal (and they are skipped when we generate the stripped builds anyhow (e.g. remove tests/data)

OK!

@jreback
Copy link
Contributor

jreback commented Dec 10, 2019

this is good to go @jorisvandenbossche @WillAyd

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

(just 2 minor doc comments that can be applied online)

pandas/io/orc.py Outdated
By file-like object, we refer to objects with a ``read()`` method,
such as a file handler (e.g. via builtin ``open`` function)
or ``StringIO``.
columns : list, default=None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
columns : list, default=None
columns : list, default None

pandas/io/orc.py Outdated
DataFrame
"""

# we require a newer version of pyarrow thaN we support for parquet
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# we require a newer version of pyarrow thaN we support for parquet
# we require a newer version of pyarrow than we support for parquet

@jreback
Copy link
Contributor

jreback commented Dec 10, 2019

@jorisvandenbossche wow I can't commit these on-line (likely because this is owned by @kkraus14 )

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. very nice implementation

@jorisvandenbossche jorisvandenbossche changed the title Add ORC reader ENH: Add ORC reader Dec 11, 2019
@jorisvandenbossche jorisvandenbossche merged commit 9b202d3 into pandas-dev:master Dec 11, 2019
@jorisvandenbossche
Copy link
Member

Thanks @kkraus14 and @jreback !

wow I can't commit these on-line (likely because this is owned by @kkraus14 )

Strange, as the branch is open to push commits by maintainers ..

jreback added a commit that referenced this pull request Dec 11, 2019
@jreback
Copy link
Contributor

jreback commented Dec 11, 2019

lgtm. very nice implementation

actually this was on my office computer; we block push so likely that’s the issue; thanks for merge

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
@kkraus14
Copy link
Contributor Author

Apologies that I didn't have the bandwidth to drive this to completion. Thanks @jreback and @jorisvandenbossche for driving this to completion.

@jreback
Copy link
Contributor

jreback commented Dec 23, 2019

thanks for putting this up @kkraus14 !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DISCUSS: What would an ORC reader/writer API look like?
6 participants